Structuring Docker files to improve build times
I was browsing through some of my old ADRs recently was reminded of some research I'd done into building Docker images. In this case I was trying to create a container image for building an older application, which required third party dependencies and lots of tweaking. This meant that I was spending a lot of time tweaking the Docker file and even more time waiting for the image to build, which quickly became very frustrating. In my case I managed to reduce build times significantly by changing how I was structuring my Docker file - this article is a discussion of my thinking
This post isn't intended to be an introduction to this technology - there's a great introduction to Docker on their site if you need one
What's the problem?
When you issue a docker build
command Docker is going to follow the instructions in your Docker file and build an image. Let's think about the following set of instructions:
ADD ".\foo.exe" .
RUN start /wait .\foo.exe
RUN del foo.exe
ADD ".\bar.ps1" .
Firstly, we add the file foo.exe
to the image, then we run it, then we delete it. Finally, we add the bar.ps1
file. This should all build reasonably quickly. Now let's imagine that foo.exe
is 500Mb and that running the install takes 15 minutes. Each time you want to tweak the contents of bar.ps1
and rebuild you're going to have to wait 15 minutes for the build to complete. Enter caching...
Caching
Obviously, we don't want to be waiting around all that time for a simple script change, so Docker uses build time caching for ADD
, COPY
, and RUN
commands. Let's walk through how this works using the previous example
Add foo.exe
to the image and add the result to the cache
ADD ".\foo.exe" .
Run foo.exe
and add the result to the cache
RUN start /wait .\foo.exe
Delete foo.exe
and add the result to the cache
RUN del foo.exe
Add bar.ps1
to the image and add the result to the cache
ADD ".\bar.ps1" .
So now what happens if we change bar.ps1
and rebuild? Docker will recognise that foo.exe
hasn't changed and so reuse the first three layers, only needing to rebuild the final one
Layering considerations
Overall, this caching strategy does help to massively improve build performance - so we're done, right? Not quite. I mentioned the word cache earlier and as we all know, caches get invalidated. In the case of Docker this is when previous layers change. Let's consider this alternative layout of my original Docker file, where I've moved the ADD ".\bar.ps1" .
command to the top
ADD ".\bar.ps1" .
ADD ".\foo.exe" .
RUN start /wait .\foo.exe
RUN del foo.exe
Obviously, each command is going to create a layer in the cache - as we discussed earlier - and if I change bar.ps1
every layer that comes after it will be invalidated and need to be rebuilt. Given this, we still need to consider how we structure our Docker files
Let's consider the following file layout:
# layer 1
ADD ".\foo.ps1" .
# layer 2
ADD ".\bar.exe" .
# layer 3
RUN start /wait powershell .\foo.ps1
# layer 4
RUN start /wait bar.exe /q
# layer 5
RUN del foo.ps1
# layer 6
RUN del bar.exe
As a long time software engineer this was my initial approach - grouping operations together by type: adding files, running the commands, deleting the files. As we've just learnt, each of these operations will generate a new layer in the cache and any change to bar.exe
, for example, will invalidate all of the layers that follow it. Pushing us back to longer build times
So now let's think about an alternative layout, where we are grouping around logical layering of the image: copy, install, and delete foo; copy, install, and delete bar. Changing bar.exe
no longer invalidates the earlier layers - so the cached versions can be reused at build time
# grouping 1
ADD ".\foo.ps1" .
RUN start /wait powershell .\foo.ps1
del foo.ps1
# grouping 2
ADD ".\bar.exe" .
RUN start /wait bar.exe /q
del bar.exe
You can also optimise a little further, as I did, to reduce the overall cache layers. In this case I ended up with just four:
# escape=`
# layer 1
ADD ".\foo.ps1" .
# layer 2
RUN start /wait powershell .\foo.ps1 `
&& del foo.ps1
# layer 3
ADD ".\bar.exe" .
# layer 4
RUN start /wait bar.exe /q `
&& del bar.exe
In summary
Obviously, this isn't a silver bullet, pretty much everything in software engineering is a trade off. The key take away here should be that build caching exists and you can use it to your advantage. You just need to think about your Docker file, and the nature of the image you are building, to leverage it to its fullest. If you want to do some in depth reading there's some great documentation on the Docker site that goes into more detail
What next?
I don't have a comment section on my blog at the moment, but I'm always happy to chat on Mastadon- Previous post: Adding topics to 11ty